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Abstract 

In this paper, we focus on a novel knowledge reuse scenario 
where the knowledge in the source schema needs to he trans¬ 
lated to a semantically heterogeneous target schema. We re¬ 
fer to this task as “knowledge translation” (KT). Unlike data 
translation and transfer learning, KT does not require any data 
from the source or target schema. We adopt a probabilistic 
approach to KT hy representing the knowledge in the source 
schema, the mapping between the source and target schemas, 
and the resulting knowledge in the target schema all as prob¬ 
ability distributions, specially using Markov random fields 
and Markov logic networks. Given the source knowledge and 
mappings, we use standard learning and inference algorithms 
for probabilistic graphical models to hnd an explicit prob¬ 
ability distribution in the target schema that minimizes the 
Kullback-Leibler divergence from the implicit distribution. 
This gives us a compact probabilistic model that represents 
knowledge from the source schema as well as possible, re¬ 
specting the uncertainty in both the source knowledge and the 
mapping. In experiments on both propositional and relational 
domains, we find that the knowledge obtained by KT is com¬ 
parable to other approaches that require data, demonstrating 
that knowledge can be reused without data. 

Introduction 

Knowledge acquisition is a critical process for building 
predictive or descriptive models for many applications. 
When domain expertise is available, knowledge can be con¬ 
structed manually. When enough high-quality data is avail¬ 
able, knowledge can be constructed automatically using data 
mining or machine learning tools. Both approaches can be 
difficult and expensive, so we would prefer to reuse or trans¬ 
fer knowledge from one application or system to another 
whenever possible. However, different applications or sys¬ 
tems often have different semantics, which makes knowl¬ 
edge reuse or transfer a non-trivial task. 

As a motivating example, suppose a new credit card com¬ 
pany without historical data wants to use the classification 
model mined by a partner credit card company to determine 
whether the applicants of the new company are qualified or 
not. Since the two companies may use different schemas to 
store their applicants’ data (e.g., in one schema, we have an¬ 
nual income recorded as a numerical attribute, while in the 
other, we have salary as an attribute with discretized ranges), 
we cannot simply reuse the old classifier. Due to privacy 


and scalability concerns, we cannot transfer the collabora¬ 
tive company’s data to the new schema either. Therefore, we 
want to translate the classification model itself to the new 
schema, without using any data. 

In this paper, we propose knowledge translation (KT) as 
a novel solution to translate knowledge across conceptu¬ 
ally similar but semantically heterogeneous schemas or on¬ 
tologies. For convenience, we refer to them genetically as 
“schemas.” As shown in the previous example, KT is useful 
in situations where data translation/transfer is problematic 
due to privacy or scalability concerns. 

We formally define knowledge translation as the task of 
converting knowledge Kg in source schema S to equiva¬ 
lent knowledge Kj- in target schema T, where the corre¬ 
spondence between the schemas is given by some mapping 
Ms p. In general, one schema may have concepts that are 
more general or specific than the other, so an exact transla¬ 
tion may not exist. We will therefore attempt to find the best 
translation, acknowledging that the best translation may still 
be a lossy approximation of the source knowledge. 

We adopt a probabilistic approach to knowledge trans¬ 
lation, in which the knowledge in the source schema, the 
mapping between the source and target schemas, and the re¬ 
sulting knowledge in the target schema are all represented 
as probability distributions. This gives us a consistent math¬ 
ematical framework for handling uncertainty at every step 
in the process. This uncertainty is clearly necessary when 
the source knowledge is probabilistic, but it is also neces¬ 
sary when there is no exact mapping between the schemas, 
or when the correct mapping is uncertain. We propose to 
represent these probability distributions using Markov ran¬ 
dom fields, for propositional (non-relational) domains, and 
Markov logic networks, for relational domains. Given prob¬ 
ability distributions for both the source knowledge and the 
schema mapping, we can combine them to define an im¬ 
plicit probability distribution in the target schema. Our goal 
is to find an explicit probability distribution in the target 
schema that is close to this implicit distribution in terms of 
the Kullback-Leibler divergence. 

Our main contributions are: 

• We formally define the problem of knowledge translation 

(KT), which allows knowledge to be reused when data is 

unavailable. 




• We propose a novel probabilistic approach for KT by 
combining probabilistic graphical models with schema 
mappings. 

• We implement an experimental KT system and evaluate 
it on two real datasets We compare our data-free KT ap¬ 
proach to baselines that use data from the source or target 
schema and show that we can obtain comparable accuracy 
without data. 

The paper is organized as follows. We first summarize 
related work, such as semantic integration, distributed data 
mining, and transfer learning, and discuss their connections 
and distinctions with KT. We then show how Markov ran¬ 
dom fields and Markov logic networks can represent knowl¬ 
edge and mappings with uncertainty. Next, we present a vari¬ 
ant of the MRF/Markov logic learning algorithm to solve the 
problem of knowledge translation. We then run experiments 
on synthetic and real datasets. Finally, we make a conclu¬ 
sion. 

Related Work 

In this section, we compare the task of knowledge translation 
with some related work. 

Semantic Integration Data integration and exchange 
(e.g., (Lenzerini 2002)) are the mostly studied areas in se¬ 
mantic integration. The main task of data integration and 
exchange is to answer queries posed in terms of the global 
schema given source databases. The standard semantics of 
global query answering is to return the tuples in every pos¬ 
sible database that is consistent with the global schema con¬ 
straints and the mapping, i.e., the set of certain answers. 

A main difference between data integration/exchange and 
knowledge translation (KT) is that KT has probabilistic se¬ 
mantics for the translation process, that is, it defines a dis¬ 
tribution of possible worlds in the target schema, instead of 
focusing only on the tuples that are in all the possible worlds 
(i.e., certain answers). 

Distributed Data Mining Efforts in distributed data min¬ 
ing (DDM) (see surveys in (Park and Kargupta 2002; 
Caragea et al. 2005)) have made considerable progress in 
mining distributed data resources without putting data in a 
centralized location. (Caragea et al. 2005) proposes a gen¬ 
eral DDM framework with two components: one sends sta¬ 
tistical queries to local data sources, and the other uses the 
returned statistics to revise the current partial hypothesis and 
generate further queries. 

Heterogeneous DDM (Caragea et al. 2005) also handles 
the semantic heterogeneity between the global and local 
schemas, in particular, those containing attributes with dif¬ 
ferent granularities called Attribute Value Taxonomy (AVT). 
Heterogeneous DDM requires local data resources and their 
mappings to the global schema to translate the statistics of 
queries. However, KT does not require data or statistics from 
either the source or the target. Instead, KT uses mappings to 
translate the generated/mined knowledge from the source di¬ 
rectly. 


Transfer Learning Transfer learning (TL) has been a suc¬ 
cessful approach to knowledge reuse. In traditional machine 
learning, only one domain and one task is involved. When 
the amount of data is limited, it is desirable to use data from 
related domains or tasks. As long as the source and target 
data share some similarity (e.g., in the distribution or under¬ 
lying feature representation), such knowledge can be used as 
a “prior” for the target task. 

Most transfer learning work focuses on the homogeneous 
case in which the source and target domain have identical 
attributes. The main exceptions are heterogeneous transfer 
learning (Yang et al. 2009) and relational transfer learn¬ 
ing (e.g., TAMAR (Mihalkova, Huynh, and Mooney 2007), 
deep transfer (Davis and Domingos 2009)). Heterogeneous 
transfer learning deals with different representations of the 
data (e.g., text and images of an object). While it uses an im¬ 
plicit mapping of two feature spaces (e.g., through Flickr), 
KT uses an explicit mapping via FOL formulas. Relational 
transfer learning deals with two analogous domains (e.g., in 
movie and university domains, directors correspond to pro¬ 
fessors ). In contrast, KT focuses on a single domain with 
two different representations. Moreover, relational transfer 
learning only handles deterministic one-to-one matchings 
which can be inferred by using a small amount of target 
data, while KT does not use any target data, and relies on 
the provided explicit FOL mapping. 

Deductive Knowledge Translation Deductive knowl¬ 
edge translation (Dou, Qin, and Liu 2011) essentially tries to 
solve the same problem, but it only considers deterministic 
knowledge and mappings. Our KT work can handle knowl¬ 
edge and mappings with uncertainty, which is more general 
than the deterministic scenario deductive knowledge trans¬ 
lation (Dou, Qin, and Liu 2011) can handle. 

See Table 1 for a summary of the similarities and differ¬ 
ences between our knowledge translation (KT) approach and 
related work. 

Table 1: Comparisons between KT and related work. We 
consider three aspects of a task: whether data is available, 
what kind of knowledge patterns are supported, and what 
kind of mapping is used. 



Data avail. 

Knowledge 

Mapping 

Data integration 

Source data 

Query results 

GLAV 

Hetero. DDM 

Source data 

Propositional 

AVT 

Hetero. TL 

Source/target 

any 

Implicit 

Relational TL 

Target data 

SRL models 

Matching 

Deductive KT 

No data 

FOL 

FOL 

KT 

No data 

SRL models 

SRL models 


Probabilistic Representations of Knowledge 
and Mappings 

To translate knowledge from one schema to another, we 
must have a representation of the knowledge and the map¬ 
pings between the two schemas. In many cases, knowl¬ 
edge and mappings are uncertain. Lor example, the mined 



source knowledge could be a probabilistic model, such as 
a Bayesian network. Mappings between two schemas may 
also be uncertain, either because a perfect alignment of the 
concepts does not exist, or because there is uncertainty about 
which alignment is the best. Therefore, we propose a proba¬ 
bilistic approach to knowledge translation. 


Representation of Knowledge 

Our approach to knowledge translation requires that the 
source and target knowledge are probability distributions 
represented as log-linear models. In some cases, the source 
knowledge mined from the data may already be represented 
as a log-linear model, such as a Bayesian network used 
for fault diagnosis or Markov logic network modeling ho- 
mophily in a social network. In other cases, we will need to 
convert the knowledge into this representation. 

For mined knowledge represented as rules, including as¬ 
sociation rules, rule sets, and decision trees (which can be 
viewed as a special case of rule sets), we can construct a 
feature for each rule, with a weight corresponding to the 
confidence or probability of the rule. The rule weight has 
a closed-form solution based on the log odds that the rule is 
correct: 


= log 


Pjfi) 

1 - P{fi) 


log 


u{fi) 

1 - u{fi) 


where p{fi) is the probability or confidence of the ith rule 
or formula and u{fi) is its probability under a uniform dis¬ 
tribution. Relational rules in an ontology can similarly be 
converted to a Markov logic network by attaching weights 
representing their relative strengths or confidences. 

For linear classifiers, such as linear support vector ma¬ 
chines or perceptrons, we can substitute logistic regression, 
a probabilistic linear classifier. 

In some cases, the knowledge we wish to translate takes 
the form of a conditional probability distribution, p{Y\X), 
or a predictive model that can be converted to a conditional 
probability distribution. This includes decision trees, neu¬ 
ral networks, and other classifiers used in data mining and 
machine learning. The method we propose will rely on a 
full joint probability distribution over all variables. We can 
convert a conditional distribution into a joint distribution by 
assuming some prior distribution over the evidence, p{X), 
such as a uniform distribution. 


Representation of Mappings 

The relationships between heterogeneous schemas can be 
represented as a mapping. We use probabilistic models to 
represent mappings. Consistent with the probabilistic repre¬ 
sentation of knowledge in a database schema, the attributes 
are considered as random variables for non-relational do¬ 
mains, and the attributes or relations are considered as first- 
order random variables for relational domains. Let us denote 
the variables in the source as X = {Xi, ...jX^r} and those 
in the target as X' = {X[, ..., A mapping is the con¬ 

ditional distribution p{X'\X). 

In real cases, a mapping is often represented as a set of 
source-to-target correspondences 

{p(C'|C,),i = l,...,/} 


where Ci C X and C'^ C X' are sets of variables in the 
source and target respectively. For the credit card company 
example, a mapping between the two schemas may include 
the correspondences of “age” and “age,” “salary” and “an¬ 
nual income,” etc. 

In order to obtain a global mapping between the source 
and target schemas using the local correspondences, we 
make the following two assumptions: 

1. p(C' U C'|X) = p(C'|X)p(C'|X), or, C' T C'|X; 

2. p(C'|X)=p(C'|C.). 

From these two assumptions, it follows that: 

p(X'|X) = Y[p{C',\X) = np(C''|c.) 

i i 

Note that these assumptions are not always correct, but 
they provide a good approximation of the global mapping 
when it is not available. 

The weight of each formula can be estimated with the log- 
odds. For example, we define a probabilistic source-to-target 
correspondence as qg —J-p q-p, where qs and qj- are queries 
(i.e., logical formulas) of source and target schemas or on¬ 
tologies, and —j-p has probabilistic semantics: 

Pr(grl95) =P 

Example 1 (Class correspondence). If a; is a graduate stu¬ 
dent, then X is a student and older than 24 with probability 
0.9, and vice versa. 

Grad(x) —>- 0.9 Student(x) A Age{x,y) A {y > 24) 
Grad(x) -<— 0.9 Student(x) A Age{x,y) A {y > 24) 

This can be converted to 

2.2 Grad(x) -A (student(x) A Age(x, y) A {y > 24)) 

2.2 Grad(x) -A- (student(x) A Age(x, y) A {y > 24)) 

Knowledge Translation 

In this section, we formalize the task of knowledge trans¬ 
lation (KT) and propose a solution to this task. We 
have the source knowledge represented as a probabilistic 
model p{X) = p{Xi, ...X„) and a probabilistic mapping 
P{X'\X). The probabilistic model in the target schema can 
be computed as 

p(X') = ^p(X)p(X'|X) = ^p(X) J]p(C'|C.) 

X X i 

( 1 ) 

Our goal is to find a compact probabilistic model in the tar¬ 
get schema (i.e., the target knowledge) without using any 
source variables as latent variables. This requirement is due 
to both efficiency (when the knowledge is being used) and 
understandability consideration. 

We also use a log-linear model q{X') to represent this 
compact model. A straight-forward objective is to minimize 
the Kullback-Leibler divergence 

q* = argminL»KL [p(X')||(7(X')] 

= argmin-^p(X')logr?(X') (2) 



The joint distribution p(X, X') is also a log-linear model 
(see Equation 1). The weights for a local correspondence can 
be computed as: 


0(C„C') = logp(C'|C,) 


exp(g(C,,C')) 

Ec' exp6»(C*,C^ 


where 9{Ci, C\) are the weights of the correspondence in 
the probabilistic mapping model. The computation of p{X') 
is therefore a standard inference task of the joint model 
p{X,X'). 


Parameter Learning The parameters of the target log- 
linear model that minimizes Equation 2 can be computed 
via standard optimization algorithms. A simple way to com¬ 
pute the objective is sampling: we first generate a sample 
from the source p{X), and then generate a sample of X' 
from p{X'\X) conditioned on the source sample. In the re¬ 
lational domain (with Markov logic or other statistical rela¬ 
tional models), each sample instance is a database, and we 
need to first decide the number of constants and create a set 
of ground variables with these constants. 


Algorithm 1 Structure Translation (MREs or MLNs) 

Input: The source schema S, source structure (propositional 
or first-order cliques) $ = {4)i}, and mapping Xi. 

Output: The target structure . 

1: for each variable (or first-order predicate) P G S that 
does not appear in Xi do 
2: Let $p denote all the cliques containing P 

3: Remove $p from $ 

4: for each pair of cliques in <I)p do 

5: Merge the two cliques and remove P 

6: Insert the resulting clique back to $ 

7: end for 

8 : end for 

9: for each clique 0 e $ do 
10: for each variable P in 0 do 

11: Let P^ be all possible correspondences of P 

12: end for 

13: Let (j)'M denote all possible correspondences of (p 

14: G- Cartesian product of P^ 

15: Add to 

16: end for 


Structure Learning The structure of the target knowl¬ 
edge can also be learned via standard structure learning al¬ 
gorithms for Markov random fields or Markov logic net¬ 
works. An alternative approach is to use heuristics to gen¬ 
erate the structure first. Lor deterministic one-to-one corre¬ 
spondences, the independences in the target schema are the 
same as those in the source schema up to renaming. If the 
correspondences are non-deterministic, we may have less 
independences in the target schema, and we could have an 
extremely complex model with large cliques. Nonetheless, 
in realistic scenarios, the correspondences in a mapping are 
usually deterministic or nearly deterministic. Therefore, it is 
reasonable to pretend they are deterministic while inferring 
the target structure. In this way we trade off between the 
complexity and accuracy of the target knowledge. 

Lirst of all, for Markov logic, we use first-order cliques 
instead of formulas as the source structure, so that it is 
consistent with the propositional case. We show the pseu¬ 
docode of the structure translation in Algorithm 1. It is 
considered as a structure learning process. The first step 
(Line 1-8) is to remove the variables that do not have a 
correspondence in the target schema. This can be done by 
standard variable elimination (Roller and Lriedman 2009; 
Poole 2003) without calculating parameters. However, exact 
variable elimination may create very large cliques and be 
very expensive, especially in Markov logic in the relational 
domains. Therefore, we approximate it by only merging two 
cliques at a time. Lor relational case, the merging involves 
a first-order unification operation (Russell and Norvig 2003; 
Poole 2003). When multiple most general unifiers exist, we 
simply include all the resulting new cliques. In the sec¬ 
ond step (Line 9-16), we replace each variable with the 
corresponding variables in the target schema. This also in¬ 
volves first-order unification in the relational case. If there 
are many-to-many correspondences, we may generate mul¬ 
tiple target cliques from one source clique. 


Example 2. Given the source Markov logic: 

Grad(a;) —>■ AgeOver25(a:) 

AgeOver25(a;) —>■ GoodCredit(a;) 

and the mapping: 

2.2 Grad(x) V Undergrad(x) -fA Student(a;) 

3.0 GoodCredit(a;) -fA HighCreditScore(a;) 

We first eliminate AgeOver2 5(a;) from the source struc¬ 
ture because it does not occur in the mapping, and we get a 
new clique 

{Grad(a;), GoodCredit(x)} 

Then we translate the clique based on the mapping, which 
gives 

{student(x), HighCreditScore(a;)} 

Experiments 

To evaluate our methods, we created two knowledge transla¬ 
tion tasks: one on a non-relational domain (NBA) and one on 
a relational domain (University). In each knowledge transla¬ 
tion task, we have 2 different database schemas as the source 
and target schemas and a dataset for each schema. The input 
of a knowledge translation system is the source knowledge 
and the mapping between the source and target schema. We 
obtained the source knowledge (i.e., a probabilistic model 
in the source) by performing a common learning algorithm 
on the source dataset, and created the probabilistic schema 
mapping manually. The output of a knowledge translation 
system is the target knowledge (i.e., a probabilistic model in 
terms of the target schema). 




Methods and Baselines 

We evaluate three different versions of our proposed proba¬ 
bilistic knowledge translation approach described in the pre¬ 
vious section. All of them use the source knowledge and 
probabilistic mapping to generate a sampled approximation 
of the distribution in the target schema, and all of them use 
these samples to learn an explicit distribution in the target 
schema. The difference between them is their approach to 
knowledge structure. LS-iTg (“learned structure”) learns the 
structure directly from the samples, which is the most flexi¬ 
ble approach. TS-ATg (“translated structure”) uses a heuris¬ 
tic translation of the structure from the source knowledge 
base. ES-Ks (“empty structure”) is a simple baseline in 
which the target knowledge base is limited to a marginal dis¬ 
tribution. 

We also compare to several baselines that make use of ad¬ 
ditional data. When there is data Ds in the source schema, 
we can use the probabilistic mapping to translate it to the tar¬ 
get schema and learn models from the translated source data. 
LS-Ds and MS-Ug learn models from translated source 
data, using learned and manually specified structures, re¬ 
spectively. When there is data Dt in the target schema, we 
can learn from this data directly. ES-Dt and MS-Dt learn 
models from target data with learned and manually specified 
structures respectively. These methods represent an unreal¬ 
istic “best case” since they use data that is typically unavail¬ 
able in knowledge translation tasks. 

We evaluate our knowledge translation methods accord¬ 
ing to two criteria: the pseudo-log-likelihood (PLL) on the 
held-out target data, and PLL on the held-out translated 
source data. The advantage of the second measure is that 
it controls for differences between the source and target dis¬ 
tributions. For relational domains, we use weighted pseudo- 
log-likelihood (WPLL), where for each predicate r, the PLL 
of each of its groundings is weighted by the Cr = l/ffr, 
where is the number of its groundings. 

Non-Relational Domain (NBA) 

We collected information on basketball players in the Na¬ 
tional Basketball Association (NBA) from two websites, 
the NBA official website nba (as the source schema) and 
the Yahoo NBA website yahoo (as the target schema). 
The schemas of these two datasets both have the name, 
height, weight, position and team of each player. In these 
schemas, the values of position have a different granularity. 
Also, in nba, we discretize height and weight into 5 equal- 
width ranges. In yahoo, we discretize them into 5 equal- 
frequency ranges (in order to make the mapping more chal¬ 
lenging). The correspondences of these attributes are origi¬ 
nally unit conversion formulas, e.g., 

h' = hx 39.3701 

After we discretize these attributes, we calculate the corre¬ 
spondence distribution of the ranges by making a simple as¬ 
sumption that each value range is uniformly distributed, e.g., 

p{h' e (73.5, 76.5]|/i € (1.858,1.966]) = 0.706 


We used the Libra Toolkit' for creating the source knowl¬ 
edge and for performing the learning and inference sub¬ 
routines required by the different knowledge translation ap¬ 
proaches. We first left out 1/5 of the data instances in the 
source and target dataset as the testing sets. For the re¬ 
maining source dataset, we used the decision tree structure 
learning (DTSL) (Lowd and Davis 2014) to learn the source 
knowledge. We used standard 4-fold cross validation to de¬ 
termine the parameters of the learning algorithm. The pa¬ 
rameters include k, prior, and mincount for decision tree 
learning, and I 2 for weight learning. 

We use Gibbs sampling for the sampling algorithm in the 
knowledge translation approaches. For LS-Ks and TS-Ks, 
we draw N samples from the source knowledge probability 
distribution. We then use the probabilistic mapping to draw 
1 target sample for each source sample. For LS-Ds, sup¬ 
pose we have Ns instances in the source dataset. We use 
the probabilistic mapping to draw N/Ns target samples for 
each source instance, such that the total number of target 
instances is also N. 

LS-Ks and TS-Ks both perform weight learning with an 
I2 prior. For structure translation with TS-Ks, we only trans¬ 
late features for which the absolute value of the weight is 
greater than a threshold 0. These two parameters are tuned 
with cross validation over a partition of the samples. 

See Figures 1 for learning curves comparing our methods 
to the baselines. We see that translated knowledge (LS-Ks 
and TS-Ks) is as accurate as knowledge learned from trans¬ 
lated source data (LS-Ds) on both the target data and the 
translated source data. This confirms that KT can be as ac¬ 
curate as data translation, but with the advantage of not re¬ 
quiring any data. We do not see a large difference between 
learning the structure (LS-Ks) and heuristically translating 
the structure (TS-Ks). As expected, the model learned di¬ 
rectly on the target data (LS-Dt) has the best PLL on the 
target data, since it could observe the target distribution di¬ 
rectly. 

Relational Domain (University) 

We use the UW-CSE dataset^ and the UO-CIS dataset 
which we collected from the Computer and Information 
Science Department of the University of Oregon. The 
UW-CSE dataset was introduced by Richardson and Domin¬ 
gos (Richardson and Domingos 2006) and is widely used in 
statistical relational learning research. In this University do¬ 
main, we have concepts such as persons, courses, and pub¬ 
lications; attributes such as PhD student stage and course 
level; and relations such as advise, teach, and author. The 
schemas of the two databases differ in their granularities of 
concepts and attribute values. For example, UW-CSE grad¬ 
uate courses are marked as level 500, while UO-CIS has 
both graduate courses at level 600 and combined undegrad¬ 
uate/graduate courses at level 4/500. 

Our methods in this relational domain are similar to those 


'http://libra.cs.uoregon.edu/ 

^http://alchemy.cs.Washington.edu/data/ 
uw-cse/. 





Figure 1; PLL for KT methods and baselines on target data (left) and translated source data (right) in the NBA domain. 


in the non-relational domain. We use Alchemy ^ for learn¬ 
ing and inference in Markov logic networks. We obtain the 
source knowledge by manually creating the formulas in the 
source schema and then using the source data to learn the 
weights. 

We use MC-SAT (Poon and Domingos 2006) as the sam¬ 
pling algorithm for these experiments. Since the behavior of 
a Markov logic network is highly sensitive to the number of 
constants, we want to keep the number of constants similar 
to the original dataset from which the model is learned. We 
set the number of constants of each type to be the average 
number over all training databases, multiplied by a scalar 
i for more efficient inference. For methods based on Ks, 
we draw N samples from the source distribution and 1 tar¬ 
get sample from each source sample and the mapping. For 
methods based on Ds, we draw N samples based on the 
mapping. Here N does not have to be large, because each 
sample instance of a relational domain is itself a database. 
We set to 1, 2 and 5 in our experiments. We set the I 2 
prior for weight learning to 10, based on cross-validation 
over samples. 

The results are shown in Table 2. In general, learning 
MLN structure (LS-Ks and LS-Ds) did not work as well as 
their counterparts with manually specified structures (MS- 
Ks and MS-Ds). From a single sample, the translated 
source data and manually specified structure (MS-Ds) were 
more effective than knowledge translation with translated 
structure (TS-Ks). However, as we increase the number 
of samples, the performance of TS-Ks improves substan¬ 
tially. With 5 samples, the performance of TS-Ks becomes 
competitive with that of MS-Ds, again demonstrating that 
knowledge translation can achieve comparable results to 
data translation but without data. When evaluated on trans¬ 
lated source data, TS-Ks shows the same trend of improv¬ 
ing with the number of samples, but its performance with 5 
relational samples is slightly worse than MS-Ds- 

^http://alchemy.cs.Washington.edu/ 
alchemyl.html 


Table 2: Evaluation on the target dataset (left) and translated 
source dataset (right) for the university domain. N/A means 
it takes too much time to run. 


Method 

WPLL on target 

WPLL on source 

# Samples 

1 

2 

5 

1 

2 

5 

ES-Ks 

-'i.ll 

-3.76 

-3.83 

-3.54 

-3.44 

-3.39 

ES-Ks 

-12.07 

-3.82 

-3.48 

-9.19 

-3.72 

-1.51 

TS-Ks 

-2.51 

-2.80 

-1.79 

-2.05 

-2.10 

-0.97 

LS-Ds 

-3.70 

-3.01 

N/A 

-1.23 

-1.23 

N/A 

MS-Ds 

-1.94 

-1.91 

-1.76 

-1.22 

-0.93 

-0.61 

ES-Dt 

-1.33 






MS-Dt 

-1.18 







Conclusion 

Knowledge translation is an important task towards knowl¬ 
edge reuse where the knowledge in the source schema 
needs to be translated to a semantically heterogeneous tar¬ 
get schema. Different from data integration and transfer 
learning, knowledge translation focuses on the scenario that 
the data may not be available in both the source and tar¬ 
get. We propose a novel probabilistic approach for knowl¬ 
edge translation by combining probabilistic graphical mod¬ 
els with schema mappings. We have implemented an exper¬ 
imental knowledge translation system and evaluated it on 
two real datasets for different prediction tasks. The results 
and comparison with baselines show that our approach can 
obtain comparable accuracy without data. 

The proposed log-linear models, such as Markov random 
fields and Markov logic networks, already cover most of 
common types of knowledge used in data mining. In the fu¬ 
ture work, we will extend our approach to the knowledge 
types which are harder to represent as log-linear models, 
such as SVMs and nearest neighbor classifiers. It might re¬ 
quire a specialized probabilistic representation. 
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